Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recursiveloader: use less memory in assert_directory_verifies #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zmedico
Copy link

@zmedico zmedico commented Dec 18, 2017

In order to avoid loading all manifest entries into memory
at once, make assert_directory_verifies use a shared sort_key
to iterate over directories and manifest entries in the same
order.

For verification of the entire gentoo tree, my tests have
shown that this change reduces the memory footprint by about
63%, while consuming about 20% more time.

@mgorny, maybe we can come up with something to accomplish
the goal of this patch while being a little less tricky?

@codecov-io
Copy link

codecov-io commented Dec 18, 2017

Codecov Report

Merging #2 into master will decrease coverage by 0.35%.
The diff coverage is 81.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #2      +/-   ##
==========================================
- Coverage    96.2%   95.84%   -0.36%     
==========================================
  Files          21       21              
  Lines        4928     5010      +82     
==========================================
+ Hits         4741     4802      +61     
- Misses        187      208      +21
Impacted Files Coverage Δ
gemato/recursiveloader.py 91.36% <81.41%> (-2.59%) ⬇️
gemato/openpgp.py 91.54% <0%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55d8019...a0ce724. Read the comment docs.

@mgorny
Copy link
Member

mgorny commented Dec 18, 2017

Is the memory use a real problem here? Speed is a problem, and 20% slowdown is not really acceptable.

@zmedico
Copy link
Author

zmedico commented Dec 18, 2017

Realistically, the memory consumption probably will not be an issue for the vast majority of systems where it makes sense to compile ebuilds from source. It's possible that loading all manifests into memory could be a scalability problem if the repository was large enough, but that doesn't seem likely to be a problem any time soon. Things like portage may choose to call gemato as a subprocess in order to avoid unnecessary memory allocation in the main process.

The 20% slowdown can be blamed on the sort_key=lambda p: p.split(os.sep) usage, but it should be possible to optimize that somehow.

@mgorny
Copy link
Member

mgorny commented Dec 18, 2017

Well, I was kinda thinking about that and decided that if it ever becomes a problem, we should probably use weak references to let Python unload unused Manifests. However, I've never looked into how well that would work.

As for the patch, it's really hard to read. If you still remember what the differences are, could you try updating it to avoid whitespace changes for now? I.e. if you dedented stuff, add if True: to keep them indented, and if you need to add indent, use 2-space tab to insert it between existing indentation.

In order to avoid loading all manifest entries into memory
at once, make assert_directory_verifies use a shared sort_key
to iterate over directories and manifest entries in the same
order.

For verification of the entire gentoo tree, my tests have
shown that this change reduces the memory footprint by about
63%, while consuming about 20% more time.
@zmedico zmedico force-pushed the iter-load-manifests-for-path branch from 3c3ac54 to a0ce724 Compare December 18, 2017 19:54
@zmedico
Copy link
Author

zmedico commented Dec 18, 2017

Modified indents to minimize diff.

@mgorny
Copy link
Member

mgorny commented Dec 18, 2017

Ok, I think I see what you're trying to do. I still think weak references are a better option here, presuming Python will free them for other objects.

@zmedico
Copy link
Author

zmedico commented Dec 18, 2017

It doesn't seem possible to implement get_file_entry_dict without having all the manifests loaded into memory at once, which is why I added the _iter_file_entry_dict method to use instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants